De-Duplication Scheduling Strategy in Real-Time Data Warehouse
نویسندگان
چکیده
Data quality of the data warehouse is crucial to decision-makers. Data duplication is considered one of the critical factors that affect the data quality. Therefore, data de-duplication is an essential process for data warehousing. Particularly, for a real-time data warehouse, it is necessary to ensure not only the data quality in real-time, but also the performance of the front-end queries and analysis. The scheduling strategy of de-duplication in a real-time data warehouse should be well studied. In this paper, we firstly investigate the three kinds of data de-duplication scheduling strategies named De-duplication Prior scheduling Strategy (DPS), Real-time scheduling Strategy (RS) and ETL Prior scheduling Strategy (EPS); then propose a new Time-Triggered scheduling Strategy (TTS) which belongs to EPS; finally evaluate the performance of the proposed scheduling strategy through experiments. This work is contributed to the efficient data cleaning and application of real-time data warehouse.
منابع مشابه
Improvement of the Analytical Queries Response Time in Real-Time Data Warehouse using Materialized Views Concatenation
A real-time data warehouse is a collection of recent and hierarchical data that is used for managers’ decision-making by creating online analytical queries. The volume of data collected from data sources and entered into the real-time data warehouse is constantly increasing. Moreover, as the volume of input data to the real time data warehouse increases, the interference between online loading ...
متن کاملGreen Energy-aware task scheduling using the DVFS technique in Cloud Computing
Nowdays, energy consumption as a critical issue in distributed computing systems with high performance has become so green computing tries to energy consumption, carbon footprint and CO2 emissions in high performance computing systems (HPCs) such as clusters, Grid and Cloud that a large number of parallel. Reducing energy consumption for high end computing can bring various benefits such as red...
متن کاملPredicting Maximum Data Staleness in Real-Time Warehouses
This paper presents an analysis technique for estimating maximum data staleness in a data warehouse that collects “near-real-time” data streams. Data is pushed to the warehouse from a variety of external sources with a wide range of inter-arrival times (e.g., once a minute to once a day). In prior work, ad hoc heuristic algorithms have been proposed for scheduling warehouse updates. In this pap...
متن کاملReal-time Scheduling of a Flexible Manufacturing System using a Two-phase Machine Learning Algorithm
The static and analytic scheduling approach is very difficult to follow and is not always applicable in real-time. Most of the scheduling algorithms are designed to be established in offline environment. However, we are challenged with three characteristics in real cases: First, problem data of jobs are not known in advance. Second, most of the shop’s parameters tend to be stochastic. Third, th...
متن کاملIntegrated Order Batching and Distribution Scheduling in a Single-block Order Picking Warehouse Considering S-Shape Routing Policy
In this paper, a mixed-integer linear programming model is proposed to integrate batch picking and distribution scheduling problems in order to optimize them simultaneously in an order picking warehouse. A tow-phase heuristic algorithm is presented to solve it in reasonable time. The first phase uses a genetic algorithm to evaluate and select permutations of the given set of customers. The seco...
متن کامل